Translation Lexicon Estimates from Non-Parallel Corpora Pairs

نویسندگان

  • Markos Mylonakis
  • Khalil Sima’an
چکیده

The estimation of translation lexicon probabilities from parallel corpora is well studied in statistical machine translation. Whenever parallel corpora are not available, it is still possible to obtain unsupervised estimates from pairs of monolingual, non-parallel corpora. In both cases the standard estimator is the Expectation-Maximization (EM) that aims at increasing the likelihood of the source corpus under the translation model. In this paper we study the utility of maximizing the joint likelihood of source and target corpora under a bi-directional translation model. A recently presented bi-directional estimator (BiEM [10]), which maximizes the joint likelihood, constitutes an instance of the EM algorithm. We show that Bi-EM reconciles the asymmetric statistics in the two corpora and leads to better lexicon estimates than standard EM. Our extensive experimental results show that relative to standard EM, our Bi-EM gives substantially better word-to-word translation results across variably related pairs of monolingual corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Automatic Acquisition of a High-Precision Translation Lexicon from Parallel Chinese-English Corpora

This paper presents a hybrid approach to deriving a translation lexicon from unaligned parallel Chinese-English corpora. Two types of information, namely, proximity and document-external distributions of word pairs, are proposed to enhance the precision of the translation lexicon derived from statistical and dictionary-based methods. The former can identify translations of Chinese compounds, wh...

متن کامل

Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word as...

متن کامل

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words ...

متن کامل

Utilizing Contextually Relevant Terms in Bilingual Lexicon Extraction

This paper demonstrates one efficient technique in extracting bilingual word pairs from non-parallel but comparable corpora. Instead of using the common approach of taking high frequency words to build up the initial bilingual lexicon, we show contextually relevant terms that co-occur with cognate pairs can be efficiently utilized to build a bilingual dictionary. The result shows that our model...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007